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ADAPTATION OF A SPEECH RECOGNITION SYSTEM ACROSS MULTIPLE 
REMOTE SESSIONS WITH A SPEAKER 



Field of the Invention 

5 The present invention relates to the field of speech recognition. More particularly, the 

present invention relates to the field of adaptation of a speech recognition system across 
multiple remote sessions with a speaker. 



Background of the Invention 
10 Speech recognition systems are known which permit a user to interface with a 

computer system using spoken language. The speech recognition system receives spoken 
W input from the user, interprets the input, and then translates the input into a form that the 
;B computer system understands. 

Speech recognition systems typically recognize spoken words or utterances based upon 
1$4 an acoustic model of a person who is speaking (the speaker). Acoustic models are typically 
generated based upon samples of speech. When the acoustic model is constructed based upon 
! f} samples of speech obtained from a number of persons rather than a specific speaker, this is 
□ called speaker-independent modeling. When a speaker-independent model is then modified 
! % for recognizing speech of a particular person based upon samples of that person's speech, this 
20 is called adaptive modeling. When a model is constructed based solely on the speech of a 
particular person, this is termed speaker-dependent modeling. 

Speaker-independent modeling generally enables a number of speakers to interface 
with the same recognition system without having obtained prior samples of the speech of the 
particular speakers. In comparison to speaker-independent modeling, adaptive modeling and 
25 speaker-dependent modeling generally enable a speech recognition system to more accurately 
recognize a speaker's speech, especially if the speaker has a strong accent, has a phone line 
which produces unusual channel characteristics or for some other reason is not well modeled 
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by speaker independent models. 

Fig. 1 illustrates a plurality of speaker-dependent acoustic models M 1? M 2 , and M n in 
accordance with the prior art. For each speaker, 1 through n, a corresponding speaker- 
dependent acoustic model Mj through M n , is stored. Thus, speech 10 of speaker 1 is 
5 recognized using the model M l and the results 12 are outputted. Similarly, speech 14 of 

speaker 2 is recognized using the model M 2 and the results 16 are outputted. And, speech 18 
of speaker n is recognized using the model M n and the results are outputted. 

A speech recognition application program called NaturallySpeaking™, which adapts to 
a particular user, is available from Dragon Systems, Inc. This application program enables a 
10 user to enter text into a written document by speaking the words to be entered into a 
; 5 ~ microphone attached to the user's computer system. The spoken words are interpreted and 

translated into typographical characters which then appear in the written document displayed 
;]§ on the user's computer screen. To adapt the application program to the particular user and to 
f: background noises of his or her environment, the user is asked to complete two initial training 
15^ sessions during which the user is prompted to read textual passages aloud. A first training 
r% session requires that the user read several paragraphs aloud, while a second training session 
: r } requires 25 to 30 to minutes for speaking and 15 to 20 minutes for processing the speech. 
□ Other speech recognition systems are known which adapt to an individual speaker 

?n based upon samples of speech obtained while the speaker is using the system, without 
20 requiring a training session. The effectiveness of this type of adaptation, however, is 
diminished when only a small sample of speech is available. 

Speech recognition systems are known which provide a telephonic interface between a 
caller and a customer service application. For example, the caller may obtain information 
regarding flight availability and pricing for a particular airline and may purchase tickets 
25 utilizing spoken language and without requiring assistance from an airline reservations clerk. 
Such customer service applications are typically intended to be accessed by a diverse 
population of callers and with various background noises. In such applications, it would be 
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impractical to ask the callers to engage in a training session prior to using the customer 
service application. Accordingly, an acoustic model utilized for such customer service 
applications must be generalized so as to account for variability in the speakers. Thus, 
speaker-independent modeling is utilized for customer service applications. A result of using 
5 speaker-independent modeling is that the recognition system is less accurate than may be 

desired. This is particularly true for speakers with strong accents and those who have a phone 
line which produces unusual channel characteristics. 

Therefore, what is needed is a technique for improving the accuracy of speech 
recognition for a speech recognition system. 

10 

Summary of the Invention 
iij The invention is a method and apparatus for adaptation of a speech recognition system 

^1 across multiple remote sessions with a speaker. The speaker can remotely access a speech 
I "y recognition system, such as via a telephone or other remote communication system. An 
lfU acoustic model is utilized for recognizing speech utterances made by the speaker. Upon 
L initiation of a first remote session with the speaker, the acoustic model is speaker-independent. 
l M During the first remote session, the speaker is uniquely identified and speech samples are 
o obtained from the speaker. In the preferred embodiment, the samples are obtained without 
requiring the speaker to engage in a training session. The acoustic model is then modified 
20 based upon the samples thereby forming a modified model. The model can be modified 

during the remote session or after the session is terminated. Upon termination of the remote 
session, the modified model is then stored in association with an identification of the speaker. 
Alternately, rather than storing the modified model, statistics that can be used to modify a 
pre-existing acoustic model are stored in association with an identification of the speaker. 
25 During a subsequent remote session, the speaker is identified and, then, the modified 

acoustic model is utilized to recognize speech utterances made by the speaker. Additional 
speech samples are obtained during the subsequent session and, then, utilized to further 
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modify the acoustic model. In this manner, an acoustic model utilized for recognizing the 
speech of a particular speaker is cumulatively modified according to speech samples obtained 
during multiple remote sessions with the speaker. As a result, the accuracy of the speech 
recognizing system improves for the speaker even when the speaker only engages in relatively 
short remote sessions. 

For each speaker to remotely access the speech recognizing system, a modified 
acoustic model, or a set of statistics that can be used to modify the acoustic model or 
incoming acoustic speech, is formed and stored along with the speaker's unique identification. 
Accordingly, multiple different acoustic models or sets of statistics are stored, one for each 
speaker. 

Brief Description of the Drawings 

Fig. 1 illustrates a plurality of speaker-dependent acoustic models in accordance with 
the prior art. 

Fig. 2 illustrates a speech recognizing system in conjunction with a remote 
communication system in accordance with the present invention. 

Fig. 3 illustrates a flow diagram for adapting an acoustic model utilized for speech 
recognition in accordance with the present invention. 

Fig. 4 illustrates a plurality of sets of transform statistics for use in conjunction with 
an acoustic model in accordance with the present invention. 

Fig. 5 illustrates a flow diagram for adapting an acoustic model utilized for speech 
recognition in accordance with an alternate embodiment of the present invention. 

Detailed Description of a Preferred Embodiment 

Fig. 2 illustrates a speech recognizing system 100 in conjunction with a remote 
communication system 150 in accordance with the present invention. The remote 
communication system 150 can be a telephone system (e.g., a central office, a private branch 
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exchange or cellular telephone system). Alternately, the remote communication system 150 
can be a communication network (e.g., a wireless network), a local area network (e.g., an 
Ethernet LAN) or a wide area network (e.g., the World Wide Web). The speech recognition 
system 100 includes a processing system, such as a general purpose processor 102, a system 
5 memory 104, a mass storage medium 106, and input/output devices 108, all of which are 

interconnected by a system bus 110. The processor 102 operates in accordance with machine 
readable computer software code stored in the system memory 1 04 and mass storage medium 
106 so as to implement the present invention. The input/output devices 108 can include a 
display monitor, a keyboard and an interface coupled to the remote system 150 for receiving 
10 speech input therefrom. Though the speech recognizing system 100 illustrated in Fig. 2 is 
] % implemented as a general purpose computer, it will be apparent that the speech recognizing 
W system can be implemented so as to include a special-purpose computer or dedicated hardware 
^ circuits. In which case, one or more of the hardware elements illustrated in Fig. 2 can be 
! 3 7 omitted or substituted by another. 
15 ^ The invention is a method and apparatus for adaptation of a speech recognizing system 

m across multiple remote sessions with a speaker. Fig. 3 illustrates a flow diagram for adapting 
y 1 an acoustic model utilized for speech recognition in accordance with the present invention. 
□ The flow diagram of Fig. 3 illustrates graphically operation of the speech recognizing system 
" J 100 in accordance with the present invention. Program flow begins in a start state 200. From 
20 the state 200, program flow moves to a state 202. In the state 202, a remote session between 
the speaker and the voice recognition system 100 is initiated. For example, a telephone call 
placed by the speaker initiates the session; in which case, the speaker is a telephone caller. 
Alternately, the remote session is conducted via another remote communication medium. 
Then, program flow moves to a state 204. 
25 In the state 204, an identification of the speaker is obtained. For example, the speaker 

can be prompted to speak his or her name, enter a personal identification number (pin), enter 
an account number, or the like. Alternately, the speaker can be automatically identified, such 
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as by receiving the speaker's caller ID for a telephone call. The speaker's identification can 
also be authenticated utilizing voice identification techniques assuming a voice sample of the 
speaker has previously been obtained by the speech recognition system 100. From the state 
204, program flow moves to a state 206. In the state 206, a determination is made as to 
5 whether the particular speaker is a first-time speaker or if samples of the speaker's speech 
have been previously obtained. This is accomplished by attempting to match the speaker's 
identification obtained in the state 204 to a prior entry stored in the memory 104 or mass 
storage 106 of the speech recognizing system 100 made in response to a prior session with the 
same speaker. It will be apparent that the prior entries can also be stored remotely from the 
10 speech recognition system 100, such as in a centralized database which is accessible to the 
[% speech recognition system 100 via a network connection which can be provided by a local 
UJ area network or the World Wide Web. 

Assuming the speaker is a first time speaker, program flow moves from the state 206 
to a state 208. In the state 208, a speaker-independent model is retrieved from the memory 
134 104 or mass storage 106 to be utilized for recognizing speech made by the speaker. The 

speaker-independent model is a generalized acoustic model generated based upon samples of 
= speech taken from multiple different representative persons. 
q The program flow then moves to a state 210. In the state 210, the speaker- 

's independent acoustic model retrieved in the state 208 is utilized for recognizing speech made 
20 by the speaker as the speaker interacts with the speech recognition system 100 during the 

remote session. For example, the speaker-independent model is utilized to recognize when the 
speaker wishes to obtain a flight schedule, a bank account balance, and so forth. In addition, 
during the state 210 samples of the speaker's speech are taken. Preferably, these samples are 
taken without prompting the speaker to speak certain words or phrases, as in a training 
25 session. It will be apparent, however, that the speaker can be prompted to speak certain 

words or phrases. In which case, prompting of the speaker is preferably performed so as to 
minimize inconvenience to the speaker. 
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Then program flow moves to a state 212. In the state 212, the speech recognition 
system 100 is modified. More particularly, the speaker-independent acoustic model utilized in 
the state 210 to recognize the speaker's speech is modified based upon the samples of the 
speaker's speech taken in the state 210, thereby forming a modified acoustic model. 
5 In the preferred embodiment, the acoustic model is modified prior to termination of 

the remote session so that the modified model can immediately be put to use. Alternately, the 
acoustic model is modified after termination of the remote session. In the preferred 
embodiment, the acoustic model is modified and put to use for speech recognition during the 
first and subsequent remote sessions. The acoustic model can also be modified between 
10 remote sessions. Thus, the states 210 and 212 can be performed repeatedly, one after the 
j other or concurrently, during a single session. For example, assuming a predetermined 
ty amount of speech (e.g., three seconds) is received (state 210), but the remote session has not 
."S yet been terminated, then the acoustic model can be modified (state 212) while a next 
: predetermined amount of speech is received (state 210). Once the next predetermined amount 
154 of speech is received, the acoustic model is again modified (state 212). For simplicity of 
illustration, however, the states 210 and 212 are shown in Fig. 3 as occurring in a simple 
succession. Once the session terminates, program flow moves to a state 214. 
Q In the state 214, a representation of the modified acoustic model, such as the modified 

;S model itself or a set of statistics that can be used to modify a pre-existing acoustic model or 
20 that can be used to modify incoming acoustic speech, is stored in the memory 104 or mass 

storage 106 or in a centralized network database. Note that rather than modifying an acoustic 
model, the present invention can be utilized to modify measurements of the speech such as 
features vectors to achieve the principle advantages of the present invention. It will be 
understood that modification of phonetic features is within the scope of the present invention 
25 and that use of the term "acoustic model" herein includes phonetic features. 

Thus, in the preferred embodiment, only a set of statistics which can be used to 
modify a pre-existing acoustic model, is stored. For example, Fig. 4 illustrates a plurality of 
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sets of transform statistics for use in conjunction with a pre-existing acoustic model M in 
accordance with the present invention. For each speaker, 1 through n, a corresponding set of 
statistics X 1? through X n , is stored. For each speaker 1 through n, the modified model can be 
considered to include the corresponding set of statistics X! through X n together with the pre- 

5 existing model M. Only one copy of pre-existing model M need be stored. 

Thus, during a subsequent telephone session, speech 300 of speaker 1 is recognized 
using the corresponding set of transform statistics Xj in conjunction with the pre-existing 
model M to recognize the speaker's speech for forming an output 302. Similarly, speech of 
speaker 2 is recognized using the corresponding set of statistics X 2 and the same pre-existing 
10 model M to recognize the speaker's speech for forming the output 302. And, speech of 

; -f speaker n is recognized using the corresponding set of statistics X n and the model M to 

UJ recognize the speaker's speech for forming an output 302. As a result, memory is conserved 

J in comparison with the prior technique illustrated in Fig. 1 . 

: y The modified model, or set of statistics, is stored in association with the identification 

lfU of the speaker for utilization for recognizing the speaker's speech in a subsequent session with 
^ the speaker. For example, assume that in the state 206, the speech recognition system 100 
•II looks up the speaker's identification and determines that a sample of the speaker's speech has 
p previously been obtained. In which case, program flow moves from the state 206 to a state 
S 216. 

20 In the state 216, the modified model or set of statistics stored in response to the 

speaker's previous remote session is retrieved from the memory 104 or mass storage 106 to 
be utilized for recognizing speech utterances made by the speaker during the current session. 
From the state 216, program flow then moves to the state 210. In the state 210, the modified 
acoustic model or set of statistics retrieved in the state 216 is utilized for recognizing speech 

25 utterances made by the speaker as the speaker interacts with the speech recognition system 
during the remote session. Additional samples of the speaker's speech are taken in the state 
210 and utilized to further modify the acoustic model for the speaker. 
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In this manner, an acoustic model utilized for recognizing the speech of a particular 
speaker is cumulatively modified according to speech samples obtained during multiple 
remote sessions with the speaker. As a result, the accuracy of the speech recognizing system 
improves for the speaker across multiple remote sessions even when the remote sessions are 
5 of relatively short duration. 

Fig. 5 illustrates a flow diagram for adapting an acoustic model utilized for speech 
recognition in accordance with an alternate embodiment of the present invention. The flow 
diagram of Fig. 5 illustrates graphically operation of the speech recognizing system 100 in 
accordance with an alternate embodiment of the present invention. Portions of Fig. 5 which 
10 have a one-to-one functional correspondence with those of Fig. 3 are given the same reference 
[%_ numeral and are not discussed further. 

Uj The flow diagram of Fig. 5 differs from that of Fig. 3 in that from the state 210, 

program flow moves to a state 400. In the state 400, a determination is made relative to the 

f y incoming speech utterance. This determination preferably assigns a confidence level related to 
IfH the accuracy of the speech recognition performed in the state 210. This can be accomplished 

j*=s by the speech recognizing system 1 00 assigning each speech utterance, such as a word, a 

phoneme, a phrase or a sentence, a certainty or score, where the assigned certainty or score is 

O related to the probability that the corresponding identified speech correctly corresponds to the 

'll spoken input, and, then, comparing the certainty or score to one or more predetermined 
20 thresholds. If the speech recognition confidence is consistently extremely high, there may be 
no need to modify (or further modify) the acoustic model for the particular speaker. By 
avoiding modification of the acoustic model, this saves processing capacity and memory of 
the speech recognition system 100 which can be devoted to other tasks. Conversely, if the 
speech recognition accuracy is extremely low, any modifications made to acoustic model 
25 based upon incorrectly recognized speech utterances or words is not expected to improve the 
accuracy of speech recognition based upon such a modified acoustic model. Accordingly, if 
the determination made in the state 400 suggests a high accuracy (e.g., the certainty exceeds a 



PATENT 

Atty. Docket No. NUAN-0Q80Q 



first threshold) or a low accuracy (e.g., the certainty is below a second threshold that is lower 
than the first threshold), then program flow returns to the state 202 upon termination of the 
remote session. In which case, no modifications to the acoustic model are performed. 

Alternately, assuming the speech recognition accuracy is determined to be moderate 
5 (e.g, the certainty falls between the first and second thresholds), then it is expected that 
modifications to the acoustic model will improve accuracy. In which case, program flow 
moves from the state 400, to the state 212. As discussed relative to Fig. 3, in the state 212, 
the speaker-independent acoustic model utilized in the state 210 to recognize the speaker's 
speech is modified based upon the samples of the speaker's speech taken in the state 210, 
10 thereby forming a modified acoustic model. 
l % In addition, because each portion of an utterance, such as a word or a phoneme, can be 

Ly associated with a different confidence level, a single utterance can have several confidence 
j levels associated with it. Thus, if some levels are above a threshold and others are below, 
\^ only those portions having a confidence level above the threshold can be used to update the 
model. 

m Note that criteria other than, or in addition to, confidence levels can be utilized for 

j Ji making the determination in the state 400 of whether or not to modify the acoustic model. 
q For example, a level of available resources in the speech recognition system 100, such as a 
low level of available memory or available processing power, may indicate that modification 
20 of the model is undesirable. 

In the state 214, a representation of the modified acoustic model, such as the modified 
model itself or a set of statistics that can be used to modify a pre-existing acoustic model, is 
stored in the memory 104 or mass storage 106 or in a centralized network database in 
association with the identification of the speaker for utilization for recognizing the speaker's 
25 speech in a subsequent remote session with the speaker. 

In an alternate embodiment, the determination made in the state 400 can be supervised. 
For example, the speech recognition system 100 can inform the speaker of the word or words 
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it has recognized and, then, ask the speaker to verify whether the speaker's speech has been 

correctly recognized. Assuming the speaker confirms that the speaker's speech has been 

correctly recognized, then program flow moves from the state 400 to the state 212. 

Accordingly, the correctly identified speech utterances or words are utilized to modify the 
5 acoustic model. Conversely, if the speaker indicates that the speech utterances or words were 

incorrectly identified, then the acoustic model is not modified based upon such incorrectly 

identified speech utterances or words. 

As described in relation to Figs. 2-5, an acoustic model utilized for recognizing the 

speech of a particular speaker is cumulatively modified according to speech samples obtained 
10 during multiple remote sessions with the speaker. As a result, the accuracy of the speech 
^ recognizing system improves for the speaker across multiple remote sessions even when the 
Uj sessions are of relatively short duration. 

."S A feature of the present invention provides an acoustic model that uniquely 

j ^ corresponds to each of a plurality of speakers. During a first remote session with each of the 
15^ speakers, the speaker-independent acoustic model is initially utilized. This model is then 
m modified according to speech samples taken for each particular speaker. Preferably, the 
j * 1 model is modified during the first and subsequent remote sessions and between sessions, 
p Each modified model is then stored in association with the corresponding speaker's 

identification. For subsequent remote sessions, the speech recognizing system 100 retrieves an 
20 appropriate acoustic model from the memory 104 or mass storage 106 based upon the 

speaker's identification. Accordingly, each acoustic model is modified based upon samples of 
the corresponding speaker's speech across multiple remote sessions with the speaker. 

To conserve memory, acoustic models that are specific to a particular speaker can be 
deleted from the memory 104 or mass storage 106 when no longer needed. For example, 
25 when a particular speaker has not engaged in a remote session with the service application for 
a predetermined period of time, then the acoustic model corresponding to that speaker is 
deleted. Should the speaker initiate a remote session after deletion of the acoustic model 
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corresponding to that speaker, the speaker-independent model is initially utilized and then 
modified according to newly acquired samples of the speaker's speech, as described above. 

According to yet another embodiment of the present invention, rather than modifying 
an acoustic model across a plurality of remote sessions based upon speech of an individual 
speaker such that the model is speaker specific, the acoustic model can be modified based 
upon speech of a group of speakers such that the model is speaker-cluster specific. For 
example, speakers from different locales, each locale being associated with a corresponding 
accent (or lack thereof), can be clustered and a model or set of statistics can be stored 
corresponding to each cluster. Thus, speakers from Minnesota can be included in a cluster, 
while speakers from Georgia can be included in another cluster. As an example, when the 
remote connection is via telephone, the speaker's telephone area code can be used to place the 
speaker into an appropriate cluster. It will be apparent that clusters can be based upon criteria 
other than locale. 

The present invention has been described in terms of specific embodiments 
incorporating details to facilitate the understanding of principles of construction and operation 
of the invention. Such reference herein to specific embodiments and details thereof is not 
intended to limit the scope of the claims appended hereto. It will be apparent to those skilled 
in the art that modifications may be made in the embodiment chosen for illustration without 
departing from the spirit and scope of the invention. 
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Claims 

What is claimed is: 



11. A method of adapting a speech recognition system, wherein the method 

2 comprises steps of: 

3 a. obtaining a sample of a speaker's speech during a first remote session; 

4 b. recognizing the speaker's speech utilizing the speech recognition system during 

5 the first remote session; 

6 c. modifying the speech recognition system according to the sample thereby 

7 forming a modified speech recognition system; 

8;!; d. storing a representation of the modified speech recognition system in 

9M association with an identification of the speaker; and 

1(]5 e. using the representation of the modified speech recognition system to recognize 

1 1 ^ speech during a subsequent remote session with the speaker. 



2. The method according to claim 1 further comprising a step of cumulatively 

2^ modifying the speech recognition system according to speech samples obtained during one or 
3rj more remote sessions with the speaker. 

1 3. The method according to claim 1 wherein the speaker is a telephone caller. 

1 4. The method according to claim 1 wherein the step of modifying the speech 



2 recognition system comprises a step of modifying an acoustic model thereby forming a 

3 modified acoustic model and wherein the step of storing a representation of the modified 

4 speech recognition system comprises a step of storing a representation of the modified 

5 acoustic model. 
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1 5. The method according to claim 4 wherein the representation of the modified 

2 acoustic model is a set of statistics which can be utilized to modify a pre-existing acoustic 

3 model. 

1 6. The method according to claim 4 wherein the representation of the modified 

2 acoustic model is a set of statistics which can be utilized to modify incoming acoustic speech. 

1 7. The method according to claim 1 further comprising a step of utilizing the 

2 modified speech recognition system during the first remote session with the speaker. 

itf 8. The method according to claim 1 wherein the speech recognition system is 

2d speaker-independent prior to the first remote session. 

W 9. The method according to claim 1 wherein the step of modifying the speech 

%k recognition system is performed during the first remote session. 

in 10. The method according to claim 1 wherein the step of modifying the speech 

recognition system is performed after termination of the first remote session. 

1 11. The method according to claim 1 further comprising a step of obtaining the 

2 identification of the speaker during the first remote session. 

1 12. The method according to claim 11 further comprising a step of authenticating 

2 the speaker's identification by the speaker's speech. 

1 13. The method according to claim 2 wherein the speech recognition system is 

2 speaker-independent prior to the first remote session. 
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1 14. The method according to claim 2 wherein the step of modifying the speech 

2 recognition system is performed during the first remote session. 

1 15. The method according to claim 2 wherein the step of modifying the speech 

2 recognition system is performed after termination of the first remote session. 

1 16. The method according to claim 2 further comprising a step of authenticating 

2 the speaker's identification by the speaker's speech. 

1 17. A method of adapting a speech recognition system, wherein the method 

i;; comprises steps of: 

3i| a. obtaining a sample of a speaker's speech during a first remote session; 

45 b. recognizing the speaker's speech utilizing the speech recognition system during 

P the first remote session; 

(jk c. modifying the speech recognition system according to the sample thereby 

7U forming a modified speech recognition system; 

fcH d. storing a representation of the modified speech recognition system in 

9i association with an identification of a cluster of speakers wherein the speaker is 

l(C a member of the cluster; and 

1 1 e. using the representation of the modified speech recognition system to recognize 

12 speech during a subsequent remote session with a member of the cluster of 

13 speakers. 

1 18. The method according to claim 17 further comprising a step of cumulatively 

2 modifying the speech recognizing system according to speech samples obtained during one or 

3 more remote sessions with one or more members of the cluster of speakers. 
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1 19. The method according to claim 17 wherein the speaker is a telephone caller. 

1 20. The method according to claim 17 wherein the step of modifying the speech 

2 recognition system comprises a step of modifying an acoustic model thereby forming a 

3 modified acoustic model and wherein the step of storing a representation of the modified 

4 speech recognition system comprises a step of storing a representation of the modified 

5 acoustic model. 

1 21. The method according to claim 20 wherein the representation of the modified 

2 acoustic model is a set of statistics which can be utilized to modify a pre-existing acoustic 
:M model. 

IS 22. The method according to claim 20 wherein the representation of the modified 

2U acoustic model is a set of statistics which can be utilized to modify incoming acoustic speech. 

1~ 23. The method according to claim 17 further comprising a step of utilizing the 

211 modified speech recognition system during the first remote session with the speaker. 

24. The method according to claim 17 wherein the speech recognition system is 

2 speaker-independent prior to the first remote session. 

1 25. The method according to claim 17 wherein the step of modifying the speech 

2 recognition system is performed during the first remote session. 

1 26. The method according to claim 17 wherein the step of modifying the speech 

2 recognition system is performed after termination of the first remote session. 
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1 27. The method according to claim 17 further comprising a step of 

2 identifying the cluster of which the speaker is a member during the first remote session. 

1 28. The method according to claim 18 wherein the speech recognition system is 

2 speaker-independent prior to the first remote session. 

1 29. The method according to claim 18 wherein the step of modifying the speech 

2 recognition system is performed during the first remote session. 

1 30. The method according to claim 18 wherein the step of modifying the speech 

i: recognition system is performed after termination of the first remote session. 

j S 31. The method according to claim 1 8 further comprising a step of authenticating 

the speaker's identification by the speaker's speech. 

j« 32. A method of adapting a speech recognition system, wherein the method 

comprises steps of: 

33 a. obtaining a sample of speech made by each of a plurality of speakers during a 

4s corresponding first remote session with each speaker; 

5 b. recognizing speech made by each speaker during the corresponding first remote 

6 session utilizing the speech recognition system configured to be speaker- 

7 independent; 

8 c. modifying the speech recognition system according to the sample from each 

9 speaker thereby forming a modified speech recognition system corresponding to 

10 each speaker; 

11 d. storing a representation of the modified speech recognition system 

12 corresponding to each speaker in association with an identification of the 



- 17 - 



PATENT 

Atty. Docket No. NUAN-00800 



13 corresponding speaker; and 

14 e. using the representation of the modified speech recognition system 

15 corresponding to a speaker to recognize speech during a subsequent remote 

16 session with the speaker. 

1 33. The method according to claim 32 further comprising a step of cumulatively 

2 modifying the speech recognition system for each speaker according to speech samples 

3 obtained during one or more remote sessions with the corresponding speaker. 

1 34. The method according to claim 32 wherein each of the plurality of speakers is 
2f a telephone caller. 

Ill 

ft 35. The method according to claim 32 wherein the step of modifying the speech 

2- recognition system comprises a step of modifying an acoustic model thereby forming a 

%4 modified acoustic model corresponding to each speaker and wherein the step of storing a 

4 n representation of the modified speech recognition system comprises a step of storing a 

Sfl representation of the modified acoustic model corresponding to each speaker. 

1^ 36. The method according to claim 35 wherein the representation of the modified 

2 acoustic model corresponding to each speaker is a set of statistics which can be utilized to 

3 modify a pre-existing acoustic model. 

1 37. The method according to claim 35 wherein the representation of the modified 

2 acoustic model corresponding to each speaker is a set of statistics which can be utilized to 

3 modify incoming acoustic speech. 

1 38. The method according to claim 32 further comprising a step of utilizing the 
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2 modified speech recognition system corresponding to each speaker during the first remote 

3 session with the corresponding speaker. 

1 39. The method according to claim 32 wherein the step of modifying the speech 

2 recognition system for each speaker is performed during the first remote session with the 

3 corresponding speaker. 

1 40. The method according to claim 32 wherein the step of modifying the speech 

2 recognition system for each speaker is performed after termination of the first remote session 

3 with the corresponding speaker. 

W 41. The method according to claim 32 further comprising a step of obtaining the 

2^ identification of each speaker during the first remote session with the speaker. 

1*£ 42. The method according to claim 41 further comprising a step of authenticating 

each speaker's identification by the speaker's speech. 

H 43. The method according to claim 33 wherein the step of modifying the speech 

2? recognition system for each speaker is performed during the first remote session with the 

3 corresponding speaker. 

1 44. The method according to claim 33 wherein the step of modifying the speech 

2 recognition system for each speaker is performed after termination of the first remote session 

3 with the corresponding speaker. 

1 45. The method according to claim 33 further comprising a step of authenticating 

2 each speaker's identification by the speaker's speech. 
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1 46. The method according to claim 32 further comprising a step of deleting the 

2 representation of a modified speech recognition system corresponding to a speaker. 



1 47. The method according to claim 46 wherein the step of deleting the 

2 representation of a modified speech recognition system corresponding to a speaker is 

3 performed when a predetermined period of time has elapsed since the corresponding speaker 

4 last engaged in a remote session. 



1 48. A speech recognition system comprising: 

2 a. an interface coupled to receive a remote session from a speaker; and 

^;;f b. a processing system coupled to the interface to recognize the speaker's speech 

4jJ wherein the processing system is cumulatively modified according to speech 

$5 samples obtained during a plurality of remote sessions with the speaker. 

49. The speech recognition system according to claim 48 wherein the speaker is a 

2L telephone caller. 

1;3 50. The speech recognition system according to claim 48 wherein the processing 

system is modified by modifying an acoustic model. 

1 51. The speech recognition system according to claim 50 wherein the processing 

2 system includes a memory for storing the acoustic model in association with an identification 



3 of the telephone caller. 



1 52. The speech recognition system according to claim 51 wherein the memory 

2 stores a plurality of acoustic models, one for each of a plurality of telephone callers and 

3 wherein each acoustic model is stored in association with an identification of the 



- 20 - 



PATENT 

Atty. Docket No. NUAN-00800 



4 corresponding telephone caller. 

1 53. The speech recognition system according to claim 52 wherein the selected ones 

2 of the plurality of acoustic models are deleted when a predetermined period of time has 

3 elapsed since the corresponding speaker last engaged in a remote session with the voice 

4 recognizer. 

1 54. A method of adapting an acoustic model utilized for speech recognition, 

2 wherein the method comprises steps of: 

3 a. obtaining a speech utterance from a speaker during a remote session; 

4»; b. recognizing the speaker's speech utilizing an acoustic model during the remote 

fa session; 

<5~ c. making a determination relative to the speech utterance; and 

d. only when indicated by the determination, performing steps of: 
*Mr i. modifying the acoustic model according to the speech utterance thereby 

forming a modified acoustic model; and 
ii. storing a representation of the modified acoustic model in association 
1 hj. with an identification of the speaker. 

1 55. The method according to claim 54 wherein the step of making the 

2 determination assigns a confidence level to the speech utterance. 

1 56. The method according to claim 54 wherein the step of making the 

2 determination assigns a confidence level to each of a plurality of portions of the speech 

3 utterance. 

1 57. The method according to claim 54 wherein the step of making a determination 
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2 determines a level of resources available for storing the representation of the modified 

3 acoustic model. 

1 58. The method according to claim 54 wherein the step of making a determination 

2 determines a level of processing resources available for performing the step of modifying the 

3 acoustic model. 



Q 

w 
O 

m 

Q 

ill 
a 
■a 



- 22 - 



PATENT 

Atty. Docket No. NUAN-00800 



Abstract 

A technique for adaptation of a speech recognizing system across multiple remote 
communication sessions with a speaker. The speaker can be a telephone caller. An acoustic 
model is utilized for recognizing the speaker's speech. Upon initiation of a first remote 
session with the speaker, the acoustic model is speaker-independent. During the first session, 
the speaker is uniquely identified and speech samples are obtained from the speaker. In the 
preferred embodiment, the samples are obtained without requiring the speaker to engage in a 
training session. The acoustic model is then modified based upon the samples thereby 
forming a modified model. The model can be modified during the session or after the session 
is terminated. Upon termination of the session, the modified model is then stored in 
association with an identification of the speaker. During a subsequent remote session, the 
speaker is identified and, then, the modified acoustic model is utilized to recognize the 
speaker's speech. Additional speech samples are obtained during the subsequent session and, 
then, utilized to further modify the acoustic model. In this manner, an acoustic model utilized 
for recognizing the speech of a particular speaker is cumulatively modified according to 
speech samples obtained during multiple sessions with the speaker. As a result, the accuracy 
of the speech recognizing system improves for the speaker even when the speaker only 
engages in relatively short remote sessions. 
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